-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For discussion: "slack" compression option #15215
base: master
Are you sure you want to change the base?
Conversation
@rincebrain Have you had a look at this? |
For what it is worth, my opinion is that optimizing ZLE should be preferred. After all, if LZ4 is so fast, why zero elimination should be slower? I understand the concerns about buffer copies - but if the end result is faster/simpler, I can not see any real issue. |
d58de83
to
0cfddae
Compare
0f4861a
to
3dc96e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presence of such a trivial algorithm makes me even more wish to have ability to run compressor/decompressor code directly on ABD instead of linear buffer. In this trivial case even one copy looks excessive, but two are just wasteful. Our benchmarks show NAS currently using 10-20 times more memory bandwidth than data read/written. We should more care about it.
Just a thought about possible alternatives: couldn't we make all decompressors we have to report how much data they've actually filled to just fill the rest with zeroes up to the logical size? This way we could run the slack check before any normal compressors and just say compressor to ignore the zeroes. This way slack handling would be done on ABD without need to linearise the zero part of the buffer and using more advanced compressor on it.
{ | ||
(void) level; | ||
ASSERT3U(d_len, >=, s_len); | ||
memcpy(dst, src, s_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wipe the rest with zeroes here.
The "slack" option simply searches from the end of the block backwards to the last non-zero byte, and sets that position as the "compressed" size. This patch is highly experimental; please see the associated PR for discussion. Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.
3dc96e0
to
dbee1a9
Compare
Motivation and Context
We have a customer that deals in very large, generally uncompressible data. They run with
compression=off
and lately have been consideringrecordsize=16M
. In testing, we found that this presented a real bottleneck for files that only fill a small amount of the last record, eg, a 17M file.The main reason for this is simply that we’re doing a ton more writing IO than we need to - we write all those trailing zeroes that we don’t even need. There’s also substantial extra CPU load, eg a lot of time spent checksumming.
So, we’ve been experimenting with various ways of ignoring the “slack space” of trailing zeroes at the end of each record. We call this “slack compression”.
There was some interest in this at the last OpenZFS Leadership call and maybe even doing it outside of the compression system. So, I’m posting this PR as a starting point for discussion!
Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Method
The method is pretty simple no matter how its hooked up:
zio_write_compress
(), call a function that will search the data buffer to find the last non-zero byte in the file. This value is thepsize
for the write.zio_read_bp_init()
, push a transform that will setpsize
to thelsize
.Yes, this is basically describing compression 😄
The
slack_compress()
function shows a fairly simple implementation of the search. It runs backwards through the data buffer as an array ofuint64_t
. It exits at the first non-zero value it finds. This implementation is loosely modeled onzio_compress_zeroed_cb()
(indeed, our earliest experiments were simply exposingZIO_COMPRESS_EMPTY
). There are obvious optimisation opportunities available here, including vectorisation and letting it work on a scatter ABD to avoid the need to linearise the incoming buffer.The interesting part is how and when to use this functionality.
compression=slack
(this PR)This PR shows the simplest form: adding a
compression=slack
option. This works just fine as a substitute forcompression=empty
, and the ARC gets the benefit as well.Early on, I plumbed this method directly into
zio_write_compress
rather than hooking it up tozio_compress_table
, because we’re not actually doing any compression, so we don’t need a destination buffer or a copy; the source buffer can be reused. As it turns out it only made a small difference to performance because the search function still requires a linear buffer.This works just fine. It could be merged today if we had any spare
DMU_BACKUP_FEATURE_*
bits). But its not really how I would like to do it.compression=none
and force psize to lsizeI am probably wrong about this, but as far as I could tell, there is only one situation outside of compression where we store
psize < lsize
, and that is when the ZIL is stored on a non-raidz vdev. Before issuing IOzio_shrink()
is called to setpsize
as small as possible. On read, the data is read into alsize
'd buffer, so it naturally falls out right. I haven’t tried hard to find the code, but my intuition is that that doesn’t work for raidz because the block gets sliced up further and that goes wrong if the sizes aren’t right coming in.If that’s right, that should mean that we could set a transform in
zio_read_bp_init()
for any uncompressed block withpsize < lsize
and it would “just work”, without requiring a separate compression option. It would still need a read feature flag as older software won’t understand it (perhaps only active if written to a raidz vdev), but possibly won’t require a stream feature flag.We wouldn’t have to, but we could also teach the ARC about this, so that even though compression is “off” it could still keep only the real data portion in memory rather than larger buffers of zeroes.
Slack with content compression?
We didn’t try this, but it might be possible to use slack with data compression. This might be worthwhile, as collapsing zeroes could perhaps be made faster than a general compressor and may still have less overhead.
I’ve not thought hard about this, but I think method should be as simple as collapsing zeroes before invoking the real compressor, and when reading back, resetting psize after compression. However I think it would require that compression algorithms are able to know their decompressed size from the compressed state, since we can’t tell them ahead of time.
If that’s no good, then I can see taking a spare bit from somewhere in the BP (maybe bit 7 of the checksum type?) and use that as a signal that a further transform is necessary after decompression. Hmm, or maybe not, because where is the size? Do we need the final size for read even? Ok, I would really need to think about that more.
Slack compress the last block only?
We also spent some time trying to see if we could only trigger slack compression for the last block in an object, for when we know that the “inner” blocks are full and so pointless to search, but it seems to be a non-starter. If a file is written in order from start to finish and the ZIL isn’t used, then its usually possible to have a clear notion of “last block” which can be plumbed down through the write policy. If the block is written out to the ZIL though (and later hooked up with
WR_INDIRECT
) then there’s no way to know that it was the “last” block. Since our customer has afsync()
-heavy workload, most writes were of this type. For workloads that aren’t callingfsync()
a lot it can be better, but there also problems if holes or zero-fills are introduced to “inner” blocks and so could benefit from slack compression are not now able to benefit. Taken together it ends up being a very situational gain, so we haven’t pursued it further.Make ZLE faster?
We considered that perhaps we could avoid this by optimising ZLE. The main difference is that ZLE is a true compression algorithm, and so copies the data. It could certainly be optimised, but regardless, it would require the entire block to be scanned, so we’d be going from a full scan and checksum to a full scan, copying some amount and then another shorter scan for the checksum. We think the best case for slack will be no copies, scanning backwards to find the end of data, and then a forward scan to compute the checksum, so at the end, the entire block has only been seen once and no copies made.
(That said, ZLE is ripe for some optimising work, though I doubt many people are using it directly these days)
Performance
We have a test scaffold available that runs a representative simulation of the customer’s workload:
open()
,write()
,fsync()
,close()
. We use this to test the effects of config changes, patches, etc.Below are results for varying combinations of
recordsize
,compression
and at different file sizes. The test is done on three 12-wide raidz3 pools, each with 10 writing threads, for 10 minutes. Note that there is a lot of other tuning involved (standard for the customer workload), so these are useful for ballpark comparison only.The “throughput” column is showing the “real” amount of file data, not the total written, which helps compare the total loss due to trailing zeroes. The “fsync” column is the average time for the
fsync()
call to return, which is a useful proxy for the total IO time (and an important number for our customer as their clients are blocked until this returns).These results have mostly given us confidence that even this quite naive version helps bring us closer to where we were with
none
, and can even rivallz4
some of the time. Our next steps are likely to see what we can gain by removing or minimising the number of allocations and copies required, and by improving the performance of the searching function.Feedback
We’ll keep working on this, but we’re interested in feedback on any or all of the above. If you’ve got workloads you can test this on, or might like us to try, let us know. If you have thoughts about what this might mean if it wasn’t a compression option but “always on”, great! (especially in “weird” IO situations, like gang blocks, or metadata, or whatever). I’d like to hear more about what it all might mean for block pointer and IO pipeline changes. And I dunno what else; if you were interested in the dev call, please throw some thoughts down too!